Reproducible research
Stephen J Eglen
Encouraging code sharing in academia
Stephen J Eglen Cambridge Computational Biology Institute
https://sje30.github.io University of Cambridge
sje30@cam.ac.uk @StephenEglen
Slides: http://bit.ly/eglen-futurepub
Acknowledgements
Co-authors, Freeman lab, Laurent Gatto.
These slides are available under a creative common CC-BY license.
Inverse problems are hard
| 70-100 |
A |
| 60-69 |
B |
| 50-59 |
C |
| 40-49 |
D |
| 0-39 |
F |
Forward problem
I scored 68, what was my grade?
Inverse problem
I got a B, what was my score?
Research sharing: the inverse problem
Where is the scholarship?
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures.
[Buckheit and Donoho 1995, after Claerbout]
Moral or selfish approach?
Selfish reasons to share
Why not align what is good for science with what is good for scientists?
- Funding mandates (REF + enforcement from Wellcome Trust)
- Credit through data papers
- Leads to further collaborations (e.g. “EPAmeadev”)
- Fixes data bugs / errors in analysis
- Prevent data loss (Vines et al 2014). e.g. students have a habit of leaving…
- Your future self is probably one of the main beneficiaries of sharing.
- Now is a very good time to be an open scientist.
Code sharing: a way forward
Specific recommendations
- Include enough code to reproduce key figure/result from your paper (“modeldb”).
- Provide toy examples if your project is too intensive to expect others to run in a few hours.
- Version control (github)
- Licence (MIT)
- Provide data
- Provide tests
- Use standards
- Use permanent URLs (Zenodo/figshare)
Simple example
Docker
Can bundle entire open-source evironment for others to share:
(start docker)
docker run -d -p 8787:8787 sje30/eglen2015
open http://192.168.99.100:8787/
This should launch a web page …
Jupyter notebooks
- Embed code within manuscript; figures/tables dynamically regenerated
binder = Docker + jupyter + cloud compute
- mybinder.org developed and supported by Freeman lab, Janelia Farm.
- Allows jupyter notebooks to be dynamically evaluated (not just rendered) online.
Find a code buddy
- We ask our students to submit a .Rnw file rather than a pdf. You get a zero if I can’t compile the pdf.
- So, ask someone else if they can run your code.
Third most important file in github repo
(After Arfon Smith)
- First: LICENSE
- Second: README.md
- Third: ???
Makefile
Learn Make if you don’t know it already.
Practical tips
- Lobby journals about their code-sharing practices.
- Lobby funders likewise.
- When reviewing articles, ask for code to be made available.
- When starting on a new project, assume code will be public at some point in the future.
Summary
- Find the selfish reasons to make your research reproducible.
- Adopt good practices to help you on your way.
- Writing code in groups can be very motivating.
- Use new tech if you want, but old tech works too.
Stephen Eglen: a brief CV
- Reader in Comptuational Neuroscience, DAMTP
- Director of MPhil in Computational Biology
- Member of Cambridge Computational Biology Institute (CCBI)
- Home page: http://sje30.github.io
- Mixing R and LaTeX (and recently markdown) since 2002…
- Advocate of Open Research
Reproducible papers
- Eglen et al. (2014)
- Eglen (2015)
Lessons I learnt
- Editors loved this.
- Reviewers engaged, edited code/figures.
- Brittle. Paper 2 broke within 6 months!
Code review pilot
- Nature Neurosience began a pilot review project in June 2017 to check for reproducibility of key figure/table. Editorial
- We wrote some guidelines for making code reproducible. Commentary
Challenges for a reproducible paper
- Technical aspects mostly there. Docker/Rmd/Jupyter notebooks/Zenodo/github.
- Sustainable compute platform required. mybinder.org was a victim of its own success. beta.mybinder.org Julia example
- Long (hours upwards) compute jobs are clearly not interactive.
- Social challenges much harder. How to incentivise/require authors to work like this?
- Publisher workflow should work with the author worflow (that starts before and may continue after publisher workflow finishes).
- Will reviewers be expected to do more than read the paper?